cp: `chore(fix): Deployment parallelism (2189)` into `r0.3.0` by ko3n1g · Pull Request #2406 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-02-17T16:53:18Z

beep boop [🤖]: Hi @ko3n1g 👋,

we've cherry picked #2189 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Changes
- Simplified deployment configuration by removing parallelism parameters.
- Adjusted experiment name length constraints for specific cluster configurations.
Chores
- Reformatted evaluation pipeline command for improved readability.

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

ko3n1g · 2026-02-17T16:53:21Z

/ok to test 02b5f5d

copy-pr-bot · 2026-02-17T16:53:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-17T16:56:53Z

📝 Walkthrough

Walkthrough

Three configuration-related scripts are modified: parallelism arguments removed from evaluation deployment commands, the command string reformatted for readability, and the maximum experiment name length adjusted for dgxc clusters.

Changes

Cohort / File(s)	Summary
Evaluation Pipeline `examples/evaluation/deploy.sh`, `examples/evaluation/launch_evaluation_pipeline.py`	Removed tensor_model_parallel_size, pipeline_model_parallel_size, and context_parallel_size arguments from deployment configuration. Command in launch script reformatted to multi-line f-string with preserved functionality.
Performance Setup `scripts/performance/setup_experiment.py`	Reduced maximum experiment name length from 37 to 33 characters for dgxc cluster configurations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

chore: Add evaluation pipeline #1876: Directly modifies the same evaluation pipeline scripts (deploy.sh and launch_evaluation_pipeline.py) that are altered in this PR.
chore(fix): Deployment parallelism #2189: Addresses the same deployment parallelism flags in the evaluation pipeline, with this PR removing arguments that the related PR had added.

Suggested labels

r0.3.0

Suggested reviewers

erhoo82
malay-nagda

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title references PR `#2189` and indicates a cherry-pick operation, but does not clearly describe the actual change—removing parallelism arguments from a deployment script.	Consider simplifying the title to focus on the primary change, such as 'Remove parallelism arguments from deployment script' or 'Fix deployment configuration for r0.3.0'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	The PR contains only minor changes: code formatting, a constraint adjustment on experiment name length, and a deployment configuration simplification. These are maintenance-level modifications that do not constitute major changes requiring comprehensive test result documentation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cherry-pick-2189-r0.3.0

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/evaluation/launch_evaluation_pipeline.py (1)
14-14: ⚠️ Potential issue | 🟡 Minor

Shebang line should be on line 1, before the copyright header.

The #!/usr/bin/env python3 shebang on line 14 is placed after the copyright comment block. For the OS to recognize the interpreter directive, it must be the very first line of the file. This is a pre-existing issue but worth fixing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluation/launch_evaluation_pipeline.py` at line 14, Move the
interpreter directive to the very top of the script so the OS recognizes it:
place the line "#!/usr/bin/env python3" as line 1 of
launch_evaluation_pipeline.py (before any copyright or comment block), removing
the existing duplicate later in the file; ensure there are no blank lines or BOM
before the shebang.

🧹 Nitpick comments (3)

examples/evaluation/launch_evaluation_pipeline.py (1)
114-123: Reformatted command looks correct; minor: missing space before | on line 118.

The multi-line reformatting is a readability improvement. However, there's a missing space before the pipe on line 118 ({args.num_gpus}| tee), unlike line 122 which has {args.parallelism} | tee. While bash handles this fine, consistency is nice.
Consistency fix
-            {args.num_gpus}| tee -a deploy.log & \
+            {args.num_gpus} | tee -a deploy.log & \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluation/launch_evaluation_pipeline.py` around lines 114 - 123,
The f-string command block used for launching the evaluation pipeline has a
missing space before the pipe after {args.num_gpus} causing an inconsistency
(see the multiline f-string assigned to command in
launch_evaluation_pipeline.py); edit that string to insert a space so it reads
"{args.num_gpus} | tee -a deploy.log & \\" to match the later
"{args.parallelism} | tee -a eval.log" and keep formatting consistent.
scripts/performance/setup_experiment.py (1)
387-389: Consider extracting the magic number 33 into a named constant.

The truncation limit changed from 37 to 33, but the rationale for the specific value isn't documented. A named constant (e.g., DGXC_MAX_EXP_NAME_LENGTH = 33) near the top of the file would make the constraint self-documenting and easier to find if it needs to change again.
Proposed refactor

Near the top of the file (e.g., after line 60):
DGXC_MAX_EXP_NAME_LENGTH = 33
Then at line 388:
-        exp_name[:33] if dgxc_cluster is not None else exp_name
+        exp_name[:DGXC_MAX_EXP_NAME_LENGTH] if dgxc_cluster is not None else exp_name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/setup_experiment.py` around lines 387 - 389, Replace the
magic number 33 with a named constant: define DGXC_MAX_EXP_NAME_LENGTH = 33 near
the top of the file (e.g., after imports or around line 60) and update the
truncation site that uses exp_name and dgxc_cluster to use
DGXC_MAX_EXP_NAME_LENGTH (exp_name = exp_name[:DGXC_MAX_EXP_NAME_LENGTH] if
dgxc_cluster is not None else exp_name); add a brief comment next to the
constant explaining the Kubernetes name-length constraint so future maintainers
can find and change it easily.
examples/evaluation/deploy.sh (1)
1-16: Missing shebang and NVIDIA copyright header.

The script lacks both a shebang line (#!/bin/bash) and the NVIDIA copyright header. While these are pre-existing omissions (not introduced by this PR), the coding guidelines require an NVIDIA copyright header on all shell scripts, and best practice calls for a shebang. Consider adding them in a follow-up.

As per coding guidelines: "Add NVIDIA copyright header to all Python files and shell scripts at the top of the file."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluation/deploy.sh` around lines 1 - 16, Add a bash shebang and
the required NVIDIA copyright header at the top of this shell script (deploy.sh)
so it complies with coding guidelines; place the shebang (#!/bin/bash) as the
very first line and immediately below it insert the standard NVIDIA
copyright/header block used across the repo, then preserve the existing logic
that references MEGATRON_CHECKPOINT, NUM_REPLICAS, NUM_GPUS and the python
invocation without other changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/evaluation/launch_evaluation_pipeline.py`:
- Line 14: Move the interpreter directive to the very top of the script so the
OS recognizes it: place the line "#!/usr/bin/env python3" as line 1 of
launch_evaluation_pipeline.py (before any copyright or comment block), removing
the existing duplicate later in the file; ensure there are no blank lines or BOM
before the shebang.

---

Nitpick comments:
In `@examples/evaluation/deploy.sh`:
- Around line 1-16: Add a bash shebang and the required NVIDIA copyright header
at the top of this shell script (deploy.sh) so it complies with coding
guidelines; place the shebang (#!/bin/bash) as the very first line and
immediately below it insert the standard NVIDIA copyright/header block used
across the repo, then preserve the existing logic that references
MEGATRON_CHECKPOINT, NUM_REPLICAS, NUM_GPUS and the python invocation without
other changes.

In `@examples/evaluation/launch_evaluation_pipeline.py`:
- Around line 114-123: The f-string command block used for launching the
evaluation pipeline has a missing space before the pipe after {args.num_gpus}
causing an inconsistency (see the multiline f-string assigned to command in
launch_evaluation_pipeline.py); edit that string to insert a space so it reads
"{args.num_gpus} | tee -a deploy.log & \\" to match the later
"{args.parallelism} | tee -a eval.log" and keep formatting consistent.

In `@scripts/performance/setup_experiment.py`:
- Around line 387-389: Replace the magic number 33 with a named constant: define
DGXC_MAX_EXP_NAME_LENGTH = 33 near the top of the file (e.g., after imports or
around line 60) and update the truncation site that uses exp_name and
dgxc_cluster to use DGXC_MAX_EXP_NAME_LENGTH (exp_name =
exp_name[:DGXC_MAX_EXP_NAME_LENGTH] if dgxc_cluster is not None else exp_name);
add a brief comment next to the constant explaining the Kubernetes name-length
constraint so future maintainers can find and change it easily.

ko3n1g · 2026-02-26T11:01:59Z

will be merged via #2509

chore(fix): Deployment parallelism (#2189)

02b5f5d

Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

ko3n1g requested review from a team, erhoo82 and malay-nagda as code owners February 17, 2026 16:53

ko3n1g added cherry-pick Run CICD labels Feb 17, 2026

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 16:54 Inactive

copy-pr-bot bot temporarily deployed to test February 17, 2026 16:54 Inactive

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 17:01 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 17:11 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 17, 2026 17:26 Inactive

ko3n1g marked this pull request as draft February 17, 2026 18:59

ko3n1g marked this pull request as ready for review February 26, 2026 11:00

ko3n1g closed this Feb 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp: `chore(fix): Deployment parallelism (2189)` into `r0.3.0`#2406

cp: `chore(fix): Deployment parallelism (2189)` into `r0.3.0`#2406
ko3n1g wants to merge 1 commit intor0.3.0from
cherry-pick-2189-r0.3.0

ko3n1g commented Feb 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

ko3n1g commented Feb 17, 2026

Uh oh!

copy-pr-bot bot commented Feb 17, 2026

Uh oh!

coderabbitai bot commented Feb 17, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

ko3n1g commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ko3n1g commented Feb 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

ko3n1g commented Feb 17, 2026

Uh oh!

copy-pr-bot bot commented Feb 17, 2026

Uh oh!

coderabbitai bot commented Feb 17, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ko3n1g commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ko3n1g commented Feb 17, 2026 •

edited by coderabbitai bot

Loading